Project - Featurization, Model Selection & Tuning - AIML - Aishik Sengupta


Data Description :

The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled). The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations).

Domain :

Cement manufacturing

Context :

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

Attribute Information :

  • Cement : measured in kg in a m3 mixture
  • Blast : measured in kg in a m3 mixture
  • Fly ash : measured in kg in a m3 mixture
  • Water : measured in kg in a m3 mixture
  • Superplasticizer : measured in kg in a m3 mixture
  • Coarse Aggregate : measured in kg in a m3 mixture
  • Fine Aggregate : measured in kg in a m3 mixture
  • Age : day (1~365)
  • Concrete compressive strength measured in MPa

Learning Outcomes:

  • Exploratory Data Analysis
  • Building ML models for regression
  • Hyper parameter tuning

Objective :

Modelling of strength of high performance concrete using Machine Learning

Steps and tasks:

  1. Deliverable -1 (Exploratory data quality report reflecting the following) (20 marks)
    • Univariate analysis (5 marks) - data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers
    • Multivariate analysis (5 marks) - Bi-variate analysis between the predictor variables and between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Presence of leverage points. Visualize the analysis using boxplots and pair plots, histograms or density curves. Select the most appropriate attributes
    • Pick one strategy to address the presence outliers and missing values and perform necessary imputation (10 marks)
  2. Deliverable -2 (Feature Engineering techniques) (15 marks)
    • Identify opportunities (if any) to create a composite feature, drop a feature etc. (5 marks)
    • Decide on complexity of the model, should it be simple linear model in terms of parameters or would a quadratic or higher degree help (5 marks)
    • Explore for gaussians. If data is likely to be a mix of gaussians, explore individual clusters and present your findings in terms of the independent attributes and their suitability to predict strength (5 marks)
  3. Deliverable -3 (create the model ) ( 15 marks)
    • Obtain feature importance for the individual features and present your findings
  4. Deliverable -4 (Tuning the model) (20 marks)
    • Algorithms that you think will be suitable for this project (5 marks)
    • Techniques employed to squeeze that extra performance out of the model without making it overfit or underfit (5 marks)
    • Model performance range at 95% confidence level (10 marks)





In [266]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# various scaling algorithms
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from scipy.stats import zscore

from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold

from sklearn.model_selection import train_test_split # to split the dataset for training and testing
from sklearn.model_selection import cross_val_score # for performing cross validation of models
from sklearn.svm import SVC # Support vector machine model

from sklearn.utils import shuffle
from sklearn import metrics # to get various evaluation metrics
from sklearn.metrics import roc_auc_score # receiver operating curve score

from sklearn.metrics import accuracy_score # accuracy of prediction score
from sklearn.metrics import recall_score # recall score
from sklearn.metrics import precision_score # precision score
from sklearn.metrics import f1_score # f1 score

from sklearn.decomposition import PCA # performing principal components analysis

# Import Linear Regression machine learning library
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

from sklearn.tree import DecisionTreeRegressor
from sklearn.cluster import KMeans

from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, BaggingRegressor

from time import time
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils import resample
from sklearn.metrics import roc_auc_score
In [2]:
# read the dataset to a dataframe
cm_df = pd.read_csv('concrete.csv')
In [3]:
# taking a look at the first 10 rows of the dataframe, already seeing some missing values
cm_df.head(10)
Out[3]:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28 29.89
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14 23.51
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28 29.22
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28 18.29
5 255.0 0.0 0.0 192.0 0.0 889.8 945.0 90 21.86
6 166.8 250.2 0.0 203.5 0.0 975.6 692.6 7 15.75
7 251.4 0.0 118.3 188.5 6.4 1028.4 757.7 56 36.64
8 296.0 0.0 0.0 192.0 0.0 1085.0 765.0 28 21.65
9 155.0 184.0 143.0 194.0 9.0 880.0 699.0 28 28.99
In [4]:
cm_df.shape # dimensions of the dataframe
Out[4]:
(1030, 9)

This is a regression problem containing all 8 predictor variables and a target variable.

In [5]:
# To check for null values
cm_df.isna().sum()
Out[5]:
cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64

There are no null values in the dataset provided to us

In [6]:
cm_df.info() # basic info such as datatype, value types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cement        1030 non-null   float64
 1   slag          1030 non-null   float64
 2   ash           1030 non-null   float64
 3   water         1030 non-null   float64
 4   superplastic  1030 non-null   float64
 5   coarseagg     1030 non-null   float64
 6   fineagg       1030 non-null   float64
 7   age           1030 non-null   int64  
 8   strength      1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
  • All are numerical datatypes.
  • Eight variables are hainvg 64 bit float datatype
  • One variable is having 64 bit integer datatype rightly so since it counting the number of days
In [7]:
# Getting the 5 point summary
cm_df.describe().T
Out[7]:
count mean std min 25% 50% 75% max
cement 1030.0 281.167864 104.506364 102.00 192.375 272.900 350.000 540.0
slag 1030.0 73.895825 86.279342 0.00 0.000 22.000 142.950 359.4
ash 1030.0 54.188350 63.997004 0.00 0.000 0.000 118.300 200.1
water 1030.0 181.567282 21.354219 121.80 164.900 185.000 192.000 247.0
superplastic 1030.0 6.204660 5.973841 0.00 0.000 6.400 10.200 32.2
coarseagg 1030.0 972.918932 77.753954 801.00 932.000 968.000 1029.400 1145.0
fineagg 1030.0 773.580485 80.175980 594.00 730.950 779.500 824.000 992.6
age 1030.0 45.662136 63.169912 1.00 7.000 28.000 56.000 365.0
strength 1030.0 35.817961 16.705742 2.33 23.710 34.445 46.135 82.6
  • Many zero values can be observered for the predictors - ash, slag and superplastic
  • Except slag, ash and age, other columns have almost same median and mean values.
In [8]:
#Checing how many rows have atleast one 0 value
(~cm_df.all(1)).sum() 
Out[8]:
805
In [8]:
# To check for correlation
colormap = plt.cm.magma
plt.figure(figsize=(20,20))

sns.heatmap(cm_df.corr(), linewidths=0.1, vmax=1.0,
           square=True, cmap=colormap, linecolor='white', annot=True)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c872aed748>
  • Strength has a good correlation with cement, superplastic and age
  • Water has a good correlation with fineagg and superplastic
In [9]:
cm_df.corr()
Out[9]:
cement slag ash water superplastic coarseagg fineagg age strength
cement 1.000000 -0.275216 -0.397467 -0.081587 0.092386 -0.109349 -0.222718 0.081946 0.497832
slag -0.275216 1.000000 -0.323580 0.107252 0.043270 -0.283999 -0.281603 -0.044246 0.134829
ash -0.397467 -0.323580 1.000000 -0.256984 0.377503 -0.009961 0.079108 -0.154371 -0.105755
water -0.081587 0.107252 -0.256984 1.000000 -0.657533 -0.182294 -0.450661 0.277618 -0.289633
superplastic 0.092386 0.043270 0.377503 -0.657533 1.000000 -0.265999 0.222691 -0.192700 0.366079
coarseagg -0.109349 -0.283999 -0.009961 -0.182294 -0.265999 1.000000 -0.178481 -0.003016 -0.164935
fineagg -0.222718 -0.281603 0.079108 -0.450661 0.222691 -0.178481 1.000000 -0.156095 -0.167241
age 0.081946 -0.044246 -0.154371 0.277618 -0.192700 -0.003016 -0.156095 1.000000 0.328873
strength 0.497832 0.134829 -0.105755 -0.289633 0.366079 -0.164935 -0.167241 0.328873 1.000000
In [10]:
#cm_df.median()
In [11]:
#To impute missing values
#impute = SimpleImputer( missing_values = 0, strategy='median')
#impute = impute.fit(df[:,0:18])
#cm_df.iloc[:,0:9] = impute.fit_transform(cm_df.iloc[:,0:9])



Exploratory Data Analysis

In [12]:
# Doing a boxplot on the whole data to check
sns.boxplot( data=cm_df, orient= "h" )
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c873604148>
In [13]:
plt.figure(figsize = (15,66))
cols = cm_df.columns.values

i=0 # column counter 
j=1 # plot counter
k=1 # plot counter of each variable

while i < (len(cols) - 1):
    if k == 1:
        plt.subplot(9,2,j)
        sns.distplot(cm_df[cols[i]])
        j+=1
        k+=1
        plt.title(f'Distribution of {cols[i]}')
        plt.xlabel(f'{cols[i]}')
        plt.ylabel('Frequency')
    else:
        plt.subplot(9,2,j)
        sns.boxplot(cm_df[cols[i]], color='yellowgreen')
        j+=1
        q1, q3 = np.percentile(cm_df[cols[i]],[25,75])
        IQR = q3 - q1 
        plt.title(f'Boxplot of {cols[i]} \n \u03bc = {round(cm_df[cols[i]].mean(), 3)},  SE = {round(cm_df[cols[i]].std(),4)}, Median = {round(cm_df[cols[i]].median(),3)}, IQR = {round(IQR, 3)} ')
        plt.xlabel(f'{cols[i]}')
        plt.ylabel('Frequency')
        i+=1
        k=1
  • Outliers can be observed in slag, water superplastic, fineagg ang age.
  • A mix of gaussians can be observed in all predictors indicating that data was sourced from different places and consolidated here.
In [15]:
sns.pairplot(cm_df, diag_kind='kde')
Out[15]:
<seaborn.axisgrid.PairGrid at 0x18378dc7948>
  • Not much association can be gathered by looking at the pairplots.
  • There is a seemingly linear relationship b/w cement and strength.
  • Strength itself follows a near normal distribution.



Checking for outliers having a standard deviation of more than 3

In [14]:
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(cm_df))
print(z)
[[1.33901711 1.60144087 0.84714393 ... 0.31296991 0.27973311 0.35501811]
 [1.07479007 0.36754132 1.09607803 ... 0.28226038 0.50146528 0.73710825]
 [0.29838379 0.85688789 0.64896501 ... 1.09337085 0.27973311 0.39514356]
 ...
 [0.04564488 0.4882354  0.56454507 ... 0.06589318 0.27973311 0.50678082]
 [0.58237302 0.41624406 0.84714393 ... 1.29254178 3.55306569 1.15238141]
 [2.47791487 0.85688789 0.84714393 ... 2.00382326 0.61233136 1.005654  ]]
In [15]:
print(np.where(z>3))
(array([  21,   44,   64,   66,  133,  149,  156,  157,  159,  198,  232,
        256,  262,  263,  270,  292,  297,  313,  323,  361,  393,  448,
        465,  484,  538,  564,  570,  601,  623,  632,  713,  720,  744,
        754,  755,  816,  838,  850,  878,  901,  918,  919,  951,  955,
        957,  990,  995, 1026, 1028], dtype=int64), array([1, 4, 7, 3, 7, 7, 4, 7, 7, 7, 4, 7, 7, 3, 7, 4, 7, 7, 7, 7, 7, 7,
       7, 7, 4, 1, 7, 7, 7, 7, 7, 7, 4, 7, 7, 4, 4, 7, 7, 7, 1, 7, 7, 4,
       7, 1, 7, 4, 7], dtype=int64))

Capping to 3 SD for outliers in slag , water. fineagg and superplastic

In [17]:
for col in ['slag','water','fineagg','superplastic']:
    percentiles = cm_df[col].quantile([0.01,0.99]).values
    cm_df[col] = np.clip(cm_df[col], percentiles[0], percentiles[1])
    
#percentiles = cm_df['superplastic'].quantile([0.01,0.99]).values
#cm_df['superplastic'] = np.clip(cm_df['superplastic'], percentiles[0], percentiles[1])

Did not cap for age as the data is well beyond 3 SD and after doing some domain research, ageing is an important factor when it comes to strength. Its best left untouched at the moment.

In [18]:
#percentiless = cm_df['fineagg'].quantile([0.023,0.977]).values
#percentiless
In [19]:
#cm_df['slag'] = np.clip(cm_df['slag'], percentiless[0], percentiless[1])
In [20]:
#cm_df = cm_df[(z < 3).all(axis=1)]
In [21]:
plt.figure(figsize = (15,66))
cols = cm_df.columns.values

i=0 # column counter 
j=1 # plot counter
k=1 # plot counter of each variable

while i < (len(cols) - 1):
    if k == 1:
        plt.subplot(9,2,j)
        sns.distplot(cm_df[cols[i]])
        j+=1
        k+=1
        plt.title(f'Distribution of {cols[i]}')
        plt.xlabel(f'{cols[i]}')
        plt.ylabel('Frequency')
    else:
        plt.subplot(9,2,j)
        sns.boxplot(cm_df[cols[i]], color='yellowgreen')
        j+=1
        q1, q3 = np.percentile(cm_df[cols[i]],[25,75])
        IQR = q3 - q1
        plt.title(f'Boxplot of {cols[i]} \n \u03bc = {round(cm_df[cols[i]].mean(), 3)},  SE = {round(cm_df[cols[i]].std(),4)}, Median = {round(cm_df[cols[i]].median(),3)}, IQR = {round(IQR, 3)} ')
        plt.xlabel(f'{cols[i]}')
        plt.ylabel('Frequency')
        i+=1
        k=1

Now the outliers are gone

In [23]:
cm_df.corr()
Out[23]:
cement slag ash water superplastic coarseagg fineagg age strength
cement 1.000000 -0.276196 -0.397467 -0.081787 0.073064 -0.109349 -0.227091 0.081946 0.497832
slag -0.276196 1.000000 -0.323827 0.106540 0.049619 -0.286267 -0.282075 -0.043105 0.137455
ash -0.397467 -0.323827 1.000000 -0.258997 0.402790 -0.009961 0.082344 -0.154371 -0.105755
water -0.081787 0.106540 -0.258997 1.000000 -0.668108 -0.179189 -0.447581 0.281015 -0.292509
superplastic 0.073064 0.049619 0.402790 -0.668108 1.000000 -0.259049 0.212209 -0.199051 0.366113
coarseagg -0.109349 -0.286267 -0.009961 -0.179189 -0.259049 1.000000 -0.175033 -0.003016 -0.164935
fineagg -0.227091 -0.282075 0.082344 -0.447581 0.212209 -0.175033 1.000000 -0.156839 -0.171102
age 0.081946 -0.043105 -0.154371 0.281015 -0.199051 -0.003016 -0.156839 1.000000 0.328873
strength 0.497832 0.137455 -0.105755 -0.292509 0.366113 -0.164935 -0.171102 0.328873 1.000000



Composite Features

Added below new features after doing some domain research

In [24]:
cm_df['wc_ratio'] = cm_df['water'] / cm_df['cement']
cm_df['wb_ratio'] = cm_df['water'] / ( cm_df['water'] + cm_df['cement'] )

Not dropping columns at the moment

In [25]:
colormap = plt.cm.magma
plt.figure(figsize=(20,20))

sns.heatmap(cm_df.corr(), linewidths=0.1, vmax=1.0,
           square=True, cmap=colormap, linecolor='white', annot=True)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c874d30148>

Thus we can see there is a very strong inverse correlation betweem the strength and the new features

In [26]:
cm_df.corr()
Out[26]:
cement slag ash water superplastic coarseagg fineagg age strength wc_ratio wb_ratio
cement 1.000000 -0.276196 -0.397467 -0.081787 0.073064 -0.109349 -0.227091 0.081946 0.497832 -0.880291 -0.934950
slag -0.276196 1.000000 -0.323827 0.106540 0.049619 -0.286267 -0.282075 -0.043105 0.137455 0.360059 0.319137
ash -0.397467 -0.323827 1.000000 -0.258997 0.402790 -0.009961 0.082344 -0.154371 -0.105755 0.246627 0.289842
water -0.081787 0.106540 -0.258997 1.000000 -0.668108 -0.179189 -0.447581 0.281015 -0.292509 0.329273 0.353625
superplastic 0.073064 0.049619 0.402790 -0.668108 1.000000 -0.259049 0.212209 -0.199051 0.366113 -0.215673 -0.250459
coarseagg -0.109349 -0.286267 -0.009961 -0.179189 -0.259049 1.000000 -0.175033 -0.003016 -0.164935 -0.032512 0.024655
fineagg -0.227091 -0.282075 0.082344 -0.447581 0.212209 -0.175033 1.000000 -0.156839 -0.171102 0.072314 0.066787
age 0.081946 -0.043105 -0.154371 0.281015 -0.199051 -0.003016 -0.156839 1.000000 0.328873 -0.029111 -0.012631
strength 0.497832 0.137455 -0.105755 -0.292509 0.366113 -0.164935 -0.171102 0.328873 1.000000 -0.501573 -0.539898
wc_ratio -0.880291 0.360059 0.246627 0.329273 -0.215673 -0.032512 0.072314 -0.029111 -0.501573 1.000000 0.978721
wb_ratio -0.934950 0.319137 0.289842 0.353625 -0.250459 0.024655 0.066787 -0.012631 -0.539898 0.978721 1.000000
In [27]:
sns.pairplot(cm_df, diag_kind='kde')
Out[27]:
<seaborn.axisgrid.PairGrid at 0x1c873df9548>

The pairplots justify the addition of new features



Decide on the complexity of model

Order 1 complexity

Linear Models

In [28]:
X = cm_df.drop(['strength'], axis = 1)
y = cm_df['strength']
In [29]:
X_train, X_test, y_train, y_test = train_test_split(X , y ,test_size = 0.30 , random_state = 1234)
In [30]:
scale = StandardScaler() # Standard scaling
scale.fit(X_train.iloc[:,:])# fitting on training data so the data integrity in test is maintained
Out[30]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [31]:
X_train.iloc[:,:] = scale.transform(X_train.iloc[:,:])
X_test.iloc[:,:] = scale.transform(X_test.iloc[:,:])
C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
In [32]:
X_train
Out[32]:
cement slag ash water superplastic coarseagg fineagg age wc_ratio wb_ratio
658 2.278479 -0.881952 -0.825973 0.371583 -1.049357 1.940231 -2.011195 -0.266101 -1.209933 -1.482620
356 -1.384736 1.046448 1.155415 -0.427981 0.673485 -0.656994 -0.112343 -0.266101 1.637405 1.527810
627 0.978932 -0.643878 0.652205 -1.133478 1.345394 -0.464131 0.906246 -0.678099 -1.066152 -1.237705
415 0.360235 1.072636 -0.825973 0.442132 -0.256849 -1.338444 0.013409 -0.266101 -0.476926 -0.380777
307 1.958300 -0.572456 1.108239 1.359279 -0.377448 -1.184154 -0.929729 -0.266101 -0.996869 -1.125285
... ... ... ... ... ... ... ... ... ... ...
279 1.468616 1.225003 -0.825973 0.230483 0.862998 -1.146867 -0.817810 -0.266101 -1.009484 -1.145496
689 -0.831958 -0.881952 0.754420 -0.728993 0.242775 0.415326 1.672075 0.195336 0.333918 0.514979
664 -1.036307 0.498877 -0.825973 0.512682 -1.049357 -0.350984 1.054634 -0.612179 1.149961 1.195605
723 -1.092809 0.015587 1.800152 -0.804246 0.311689 0.426897 -0.035634 0.195336 0.747771 0.882411
815 -0.951554 -0.343905 1.092514 -0.512641 0.363374 1.096776 0.097663 -0.266101 0.600256 0.757170

721 rows × 10 columns

Linear Regression

In [33]:
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

for idx, col in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col, reg_model.coef_[idx]))
The coefficient for cement is 5.6341173172644305
The coefficient for slag is 8.780376052692777
The coefficient for ash is 5.711588826921984
The coefficient for water is -0.8710858071124834
The coefficient for superplastic is 1.9842179457029168
The coefficient for coarseagg is 1.6279634080073284
The coefficient for fineagg is 1.5473928823046714
The coefficient for age is 7.176523957331023
The coefficient for wc_ratio is 4.261109440728485
The coefficient for wb_ratio is -11.915885834469142
In [34]:
reg_model.coef_
Out[34]:
array([  5.63411732,   8.78037605,   5.71158883,  -0.87108581,
         1.98421795,   1.62796341,   1.54739288,   7.17652396,
         4.26110944, -11.91588583])
In [35]:
intercept = reg_model.intercept_

print("The intercept for our model is {}".format(intercept))
The intercept for our model is 35.81976421636617

Ridge Regression

In [36]:
ridge = Ridge(alpha=.3)
ridge.fit(X_train,y_train)
print ("Ridge model:", (ridge.coef_))
Ridge model: [  6.10279243   8.72429259   5.62958604  -1.0709561    2.00912053
   1.56858618   1.50562527   7.16999823   3.56188603 -10.66251302]

Lasso Regression

In [37]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train,y_train)
print ("Lasso model:", (lasso.coef_))
Lasso model: [ 6.53818173  7.00141464  3.95614355 -2.92338028  2.05989693  0.13562283
 -0.          6.97081241 -0.         -4.79784305]



Decision Tree Regression

In [38]:
dt_model = DecisionTreeRegressor( max_depth=10)
In [39]:
dt_model.fit(X_train, y_train)
Out[39]:
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=10,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')
In [40]:
dt_model.score(X_test, y_test)
Out[40]:
0.791617587346187
In [41]:
dt_model.score(X_train, y_train)
Out[41]:
0.9818606947345222



Comparing the scores

In [42]:
#Linear Regression
print(reg_model.score(X_train, y_train))
print(reg_model.score(X_test, y_test))
0.6469072965023624
0.5693919307295272
In [43]:
# Ridge Regression
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))
0.6468870351477615
0.5693373180199565
In [44]:
# Lasso Regression
print(lasso.score(X_train, y_train))
print(lasso.score(X_test, y_test))
0.6443160333276005
0.5661724346168292
In [45]:
rfTree = RandomForestRegressor(n_estimators=50)
rfTree.fit(X_train,y_train)
print("rfTree on train data ", rfTree.score(X_train,y_train))
print("rfTree on test data ", rfTree.score(X_test, y_test))
rfTree on train data  0.9887542518935672
rfTree on test data  0.876199983678959



Time to use Quadratic model to resolve non-linear interaction

In [46]:
#X_train = X_train.drop(['coarseagg','wc_ratio'], axis = 1 , inplace= True)
#X_test = X_test.drop(['coarseagg','wc_ratio'], axis = 1 , inplace= True)
In [47]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2, interaction_only=True)
In [48]:
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)
In [49]:
X_train.shape
Out[49]:
(721, 56)
In [50]:
X_test.shape
Out[50]:
(309, 56)
In [51]:
#X_train, X_test, y_train, y_test = train_test_split(X_poly , y ,test_size = 0.30 , random_state = 1)
In [52]:
X_train
Out[52]:
array([[ 1.        ,  2.27847859, -0.88195206, ...,  0.32196465,
         0.39452705,  1.79387161],
       [ 1.        , -1.38473574,  1.04644769, ..., -0.43571536,
        -0.40655207,  2.50164359],
       [ 1.        ,  0.97893212, -0.64387802, ...,  0.72295596,
         0.83928593,  1.31958115],
       ...,
       [ 1.        , -1.03630662,  0.49887739, ..., -0.70398171,
        -0.73192408,  1.37489822],
       [ 1.        , -1.09280864,  0.01558708, ...,  0.14606654,
         0.17236659,  0.65984154],
       [ 1.        , -0.95155359, -0.34390472, ..., -0.1597288 ,
        -0.2014838 ,  0.45449569]])
In [53]:
X_train
Out[53]:
array([[ 1.        ,  2.27847859, -0.88195206, ...,  0.32196465,
         0.39452705,  1.79387161],
       [ 1.        , -1.38473574,  1.04644769, ..., -0.43571536,
        -0.40655207,  2.50164359],
       [ 1.        ,  0.97893212, -0.64387802, ...,  0.72295596,
         0.83928593,  1.31958115],
       ...,
       [ 1.        , -1.03630662,  0.49887739, ..., -0.70398171,
        -0.73192408,  1.37489822],
       [ 1.        , -1.09280864,  0.01558708, ...,  0.14606654,
         0.17236659,  0.65984154],
       [ 1.        , -0.95155359, -0.34390472, ..., -0.1597288 ,
        -0.2014838 ,  0.45449569]])

Linear Regression

In [54]:
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
Out[54]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Ridge Regression

In [55]:
ridge = Ridge(alpha=.3)
ridge.fit(X_train,y_train)
print ("Ridge model:", (ridge.coef_))
Ridge model: [ 0.00000000e+00  4.86407178e+00  8.84438554e+00  5.50734299e+00
 -1.33216121e+00  2.77529962e+00 -8.13692877e-01  2.49274301e-01
  1.36380410e+01 -4.11742607e+00 -4.07069862e+00  3.58383215e+00
 -6.99155989e-01 -1.16421409e+01 -3.61324007e+00  2.75413940e+00
 -7.20090611e-01 -2.41233066e+00 -1.26091667e+00 -2.68403739e+00
  2.25025985e+00 -5.44196772e+00 -3.50147383e+00 -1.68879332e-01
  2.36007432e+00  4.60201963e+00 -2.09185688e+00  3.94409013e+00
 -3.62556704e+00 -4.49906139e+00 -8.88500088e-04  2.41825375e+00
  7.30699639e+00  2.48412389e+00 -5.11367055e+00 -1.66378005e+00
 -3.38592697e+00 -2.80982978e+00  9.90960640e-01  1.56345976e+00
 -7.71857383e+00 -8.45892895e-01 -3.27142952e+00  8.52223734e-01
 -1.59329779e+01  1.51373078e+01  7.16639525e-01  1.49193536e-01
 -7.57045251e+00  9.24544480e+00  2.65401056e+00 -1.45142884e+00
 -1.27946156e+00  7.27966689e+00 -1.43834453e+01 -2.57484280e+00]

Lasso Regression

In [56]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train,y_train)
print ("Lasso model:", (lasso.coef_))
Lasso model: [ 0.          4.25461628  7.53528911  3.95422699 -0.76158595  4.21237467
 -0.14363106 -0.         12.98374653 -1.45873775 -6.08710818 -0.
 -0.         -2.73569552 -1.79927493  0.          0.          0.
 -0.         -0.03657554  1.0332708  -0.          0.29933181 -0.04600091
  2.26212816  1.21653717  0.          0.         -1.10823809 -1.03477316
  0.09561806  1.61638254  2.78656061 -0.37544603 -0.          1.63591822
 -0.1692384  -0.50896931 -1.94262515 -0.          0.          1.35246674
 -0.          3.01700855 -1.09369201  0.          0.56073743 -0.28053589
  0.01724896  0.          0.         -0.         -0.         -0.
 -0.         -0.        ]
C:\Users\user\Anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:476: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 37.334133747935994, tolerance: 20.55749652599168
  positive)



Comparing the scores

In [57]:
#Linear Regression
print(reg_model.score(X_train, y_train))
print(reg_model.score(X_test, y_test))
0.7862700966396026
0.7020362275887368
In [58]:
# Ridge Regression
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))
0.786799885318611
0.7129421087004818
In [59]:
# Lasso Regression
print(lasso.score(X_train, y_train))
print(lasso.score(X_test, y_test))
0.7662950423534712
0.7168554958388105
In [60]:
rfTree = RandomForestRegressor(n_estimators=50)
rfTree.fit(X_train,y_train)
print("rfTree on train data ", rfTree.score(X_train,y_train))
print("rfTree on test data ", rfTree.score(X_test, y_test))
rfTree on train data  0.9875069234910384
rfTree on test data  0.8688966939157818



Decision Tree Regression

In [61]:
dt_model = DecisionTreeRegressor( max_depth=5)
In [62]:
dt_model.fit(X_train, y_train)
Out[62]:
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=5,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')
In [63]:
dt_model.score(X_test, y_test)
Out[63]:
0.7516005752919707
In [64]:
dt_model.score(X_train, y_train)
Out[64]:
0.8633678728109131
In [65]:
X_train.shape
Out[65]:
(721, 56)

Although Linear , Lasso & Ridge gave better results in order 2 complexity models, but order 1 is enough to capture the data. After fine tuning the models , we will resolve the overfit issue





Exploring K-Means Clusters based on mix of Gaussians

In [66]:
cm_df_z = cm_df.apply(zscore) # standardised score
In [223]:
# cluster analysis using Kmeans
cluster_range = range( 2, 15 )
cluster_errors = []
for num_clusters in cluster_range:
  clusters = KMeans( num_clusters, n_init = 10 )
  clusters.fit(cm_df_z)
  labels = clusters.labels_
  centroids = clusters.cluster_centers_
  cluster_errors.append( clusters.inertia_ )
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
In [224]:
clusters_df # checking errors for various clusters created
Out[224]:
num_clusters cluster_errors
0 2 8844.583630
1 3 7407.696209
2 4 6254.043929
3 5 5539.805511
4 6 5149.636052
5 7 4799.982253
6 8 4458.046025
7 9 4136.787486
8 10 3884.845420
9 11 3693.858565
10 12 3545.683089
11 13 3392.737251
12 14 3258.214231
In [225]:
# Elbow plot to find optimal cluster
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
Out[225]:
[<matplotlib.lines.Line2D at 0x259c317ec88>]

Taking clusters 5 & 4 seems reasonable

Taking 5 clusters

In [227]:
kmeans = KMeans(n_clusters= 5)
kmeans.fit(cm_df_z)
Out[227]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [228]:
labels = kmeans.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
[195 235 228 327  45]
In [234]:
## creating a new dataframe only for labels and converting it into categorical variable
cluster_labels = pd.DataFrame(kmeans.labels_ , columns = list(['labels']))
cluster_labels['labels'] = cluster_labels['labels'].astype('category')
cm_df_labeled = cm_df.join(cluster_labels)

cm_df_labeled.boxplot(by = 'labels',  layout=(6,2), figsize=(20, 60))
Out[234]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000259C9280588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259C80302C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259C9280888>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259C80695C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259C809D688>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259C80CF948>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259C8103C08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259C813B608>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259C8143488>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259C817C088>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259C81DAE48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259C8213C88>]],
      dtype=object)
In [235]:
prediction=kmeans.predict(cm_df_z)
cm_df_z["GROUP"] = prediction     # Creating a new column "GROUP" which will hold the cluster id of each record

cm_df_z_copy = cm_df_z.copy(deep = True)  # Creating a mirror copy for later re-use instead of building repeatedly

Showing centroid data across various predictors for 5 cluster size

In [237]:
centroids = kmeans.cluster_centers_
centroids
Out[237]:
array([[-1.09134967,  1.23992692, -0.20795751,  0.59880088, -0.40349589,
        -0.36661938, -0.17509195, -0.23156931, -0.5975245 ,  1.47377741,
         1.35886373],
       [ 0.59988632, -0.58768355, -0.8186048 ,  0.59647   , -1.012888  ,
         0.45907628, -0.17541653, -0.146896  , -0.40888229, -0.50066612,
        -0.45083946],
       [ 0.99151597,  0.43257658, -0.33666796, -0.85131499,  1.00835385,
        -0.64609892,  0.06176808, -0.2123521 ,  1.08643025, -0.98236603,
        -1.12792028],
       [-0.56040924, -0.60660861,  1.06362574, -0.40905138,  0.41276264,
         0.34482513,  0.36125989, -0.13549581, -0.19881107,  0.19633913,
         0.32461859],
       [ 0.64506844, -0.08770132, -0.84714393,  1.57606664, -1.0703929 ,
        -0.04087577, -1.26331759,  3.83111078,  0.66466084, -0.22118879,
        -0.17812468]])
In [238]:
centroid_df = pd.DataFrame(centroids, columns = list(cm_df) )
centroid_df
Out[238]:
cement slag ash water superplastic coarseagg fineagg age strength wc_ratio wb_ratio
0 -1.091350 1.239927 -0.207958 0.598801 -0.403496 -0.366619 -0.175092 -0.231569 -0.597524 1.473777 1.358864
1 0.599886 -0.587684 -0.818605 0.596470 -1.012888 0.459076 -0.175417 -0.146896 -0.408882 -0.500666 -0.450839
2 0.991516 0.432577 -0.336668 -0.851315 1.008354 -0.646099 0.061768 -0.212352 1.086430 -0.982366 -1.127920
3 -0.560409 -0.606609 1.063626 -0.409051 0.412763 0.344825 0.361260 -0.135496 -0.198811 0.196339 0.324619
4 0.645068 -0.087701 -0.847144 1.576067 -1.070393 -0.040876 -1.263318 3.831111 0.664661 -0.221189 -0.178125
In [245]:
cm_df_z.boxplot(by = 'GROUP', layout=(4,3), figsize=(15, 25))
Out[245]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000259CAF00CC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC778CC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC74E3C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC7EA808>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC820E88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC858508>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC88EE48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC8C7C08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC8D3808>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC90C708>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC972708>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259CC9A84C8>]],
      dtype=object)

Doing visual analysis on clusters for each predictor

In [242]:
var = 'cement'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[242]:
<seaborn.axisgrid.FacetGrid at 0x259c474b788>
In [246]:
var = 'water'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[246]:
<seaborn.axisgrid.FacetGrid at 0x259d3711688>
In [247]:
var = 'ash'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[247]:
<seaborn.axisgrid.FacetGrid at 0x259d105b408>
In [248]:
var = 'slag'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[248]:
<seaborn.axisgrid.FacetGrid at 0x259d10eb288>
In [249]:
var = 'coarseagg'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[249]:
<seaborn.axisgrid.FacetGrid at 0x259d117c508>
In [250]:
var = 'fineagg'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[250]:
<seaborn.axisgrid.FacetGrid at 0x259d117cd48>
In [251]:
var = 'age'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[251]:
<seaborn.axisgrid.FacetGrid at 0x259d11e4ec8>
In [252]:
var = 'superplastic'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[252]:
<seaborn.axisgrid.FacetGrid at 0x259d134bbc8>
In [253]:
var = 'wc_ratio'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[253]:
<seaborn.axisgrid.FacetGrid at 0x259d1347e88>
In [254]:
var = 'wb_ratio'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z,hue='GROUP')
plot.set(ylim = (-3,3))
Out[254]:
<seaborn.axisgrid.FacetGrid at 0x259d1485448>

Repeating same steps by taking 4 clusters

In [255]:
cm_df_z2 = cm_df.apply(zscore)
kmeans = KMeans(n_clusters= 4, random_state= 1)
kmeans.fit(cm_df_z2)
Out[255]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)
In [256]:
labels = kmeans.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
[329 226 201 274]
In [257]:
prediction=kmeans.predict(cm_df_z2)
cm_df_z2["GROUP"] = prediction     # Creating a new column "GROUP" which will hold the cluster id of each record

cm_df_z2_copy = cm_df_z2.copy(deep = True)  # Creating a mirror copy for later re-use instead of building repeatedly
In [258]:
centroids = kmeans.cluster_centers_
centroids
Out[258]:
array([[-0.56172624, -0.60816455,  1.05201012, -0.4070082 ,  0.40374649,
         0.35462484,  0.36454138, -0.13738357, -0.20698553,  0.19809147,
         0.32660124],
       [ 0.98251332,  0.44133966, -0.34916728, -0.87076737,  1.02073177,
        -0.64146366,  0.07058531, -0.21371804,  1.08451634, -0.9820097 ,
        -1.12764122],
       [-1.08173439,  1.25208638, -0.2270377 ,  0.63240726, -0.42340327,
        -0.35472254, -0.21213132, -0.11504931, -0.58154949,  1.46408115,
         1.35298399],
       [ 0.65762239, -0.55228463, -0.80863119,  0.74301191, -1.01610919,
         0.36349797, -0.34032116,  0.42563643, -0.21938323, -0.50189128,
        -0.4545791 ]])
In [259]:
centroid_df = pd.DataFrame(centroids, columns = list(cm_df) )
centroid_df
Out[259]:
cement slag ash water superplastic coarseagg fineagg age strength wc_ratio wb_ratio
0 -0.561726 -0.608165 1.052010 -0.407008 0.403746 0.354625 0.364541 -0.137384 -0.206986 0.198091 0.326601
1 0.982513 0.441340 -0.349167 -0.870767 1.020732 -0.641464 0.070585 -0.213718 1.084516 -0.982010 -1.127641
2 -1.081734 1.252086 -0.227038 0.632407 -0.423403 -0.354723 -0.212131 -0.115049 -0.581549 1.464081 1.352984
3 0.657622 -0.552285 -0.808631 0.743012 -1.016109 0.363498 -0.340321 0.425636 -0.219383 -0.501891 -0.454579
In [260]:
cm_df_z2.boxplot(by = 'GROUP', layout=(4,3), figsize=(15, 25))
Out[260]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000259D157AE48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259D1546908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259D1512B48>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259D15D2908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259D1628A88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259D165D848>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259D1693EC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259D16D1F08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259D16D80C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000259D170AA88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259D1770188>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000259D17A4A88>]],
      dtype=object)
In [261]:
var = 'cement'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[261]:
<seaborn.axisgrid.FacetGrid at 0x259d5f5e508>
In [262]:
var = 'water'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[262]:
<seaborn.axisgrid.FacetGrid at 0x259d53346c8>
In [263]:
var = 'ash'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[263]:
<seaborn.axisgrid.FacetGrid at 0x259d53a4248>
In [264]:
var = 'slag'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[264]:
<seaborn.axisgrid.FacetGrid at 0x259d5451348>
In [265]:
var = 'coarseagg'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[265]:
<seaborn.axisgrid.FacetGrid at 0x259d5401848>
In [266]:
var = 'fineagg'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[266]:
<seaborn.axisgrid.FacetGrid at 0x259d55851c8>
In [267]:
var = 'age'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[267]:
<seaborn.axisgrid.FacetGrid at 0x259d55ee708>
In [268]:
var = 'superplastic'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[268]:
<seaborn.axisgrid.FacetGrid at 0x259d5698e08>
In [269]:
var = 'wc_ratio'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[269]:
<seaborn.axisgrid.FacetGrid at 0x259d57153c8>
In [270]:
var = 'wb_ratio'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=cm_df_z2,hue='GROUP')
plot.set(ylim = (-3,3))
Out[270]:
<seaborn.axisgrid.FacetGrid at 0x259d57b12c8>

We can say that age, wc_ratio and wb_ratio are having linear relationship with strength , other than that Kmeans doesn't help much

Lets do some PCA

In [67]:
X = cm_df.drop(['strength'], axis = 1)
y = cm_df['strength']
In [68]:
scale2 = StandardScaler() # Standard scaling
scale2.fit(X.loc[:,:])# fitting on training data so the data integrity in test is maintained

X_scaled = scale2.transform(X.loc[:,:])
In [69]:
#should give an 18*18 matrix
covMatrix = np.cov(X_scaled,rowvar=False)
print(covMatrix)
[[ 1.00097182 -0.27646443 -0.39785361 -0.08186691  0.07313521 -0.10945526
  -0.2273117   0.08202566 -0.88114648 -0.93585814]
 [-0.27646443  1.00097182 -0.32414217  0.10664354  0.04966695 -0.28654565
  -0.28234922 -0.04314733  0.36040875  0.3194472 ]
 [-0.39785361 -0.32414217  1.00097182 -0.25924864  0.40318139 -0.00997051
   0.08242408 -0.15452054  0.24686669  0.2901239 ]
 [-0.08186691  0.10664354 -0.25924864  1.00097182 -0.66875758 -0.17936283
  -0.44801565  0.28128783  0.32959288  0.35396885]
 [ 0.07313521  0.04966695  0.40318139 -0.66875758  1.00097182 -0.25930063
   0.2124157  -0.19924397 -0.21588229 -0.25070229]
 [-0.10945526 -0.28654565 -0.00997051 -0.17936283 -0.25930063  1.00097182
  -0.17520342 -0.00301881 -0.03254349  0.0246788 ]
 [-0.2273117  -0.28234922  0.08242408 -0.44801565  0.2124157  -0.17520342
   1.00097182 -0.15699178  0.07238409  0.06685143]
 [ 0.08202566 -0.04314733 -0.15452054  0.28128783 -0.19924397 -0.00301881
  -0.15699178  1.00097182 -0.02913914 -0.01264335]
 [-0.88114648  0.36040875  0.24686669  0.32959288 -0.21588229 -0.03254349
   0.07238409 -0.02913914  1.00097182  0.9796717 ]
 [-0.93585814  0.3194472   0.2901239   0.35396885 -0.25070229  0.0246788
   0.06685143 -0.01264335  0.9796717   1.00097182]]
In [71]:
#Performing PCA on all of its 10 predictors
pca = PCA(n_components=10)
pca.fit(X_scaled)
Out[71]:
PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [74]:
plt.bar(range(1,11), pca.explained_variance_ratio_, alpha = 0.5, align='center', label='individual explained variance')
plt.step( range(1,11), np.cumsum(pca.explained_variance_ratio_), where='mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc = 'best')
plt.show()
In [75]:
df_comp = pd.DataFrame(pca.components_,columns=list(X.columns.values))
In [77]:
df_comp
Out[77]:
cement slag ash water superplastic coarseagg fineagg age wc_ratio wb_ratio
0 -0.506786 0.216623 0.131999 0.231541 -0.170517 -0.004158 0.000589 0.002229 0.540335 0.551495
1 0.214197 0.127233 -0.437872 0.506548 -0.483680 0.037705 -0.398490 0.298582 -0.056677 -0.055526
2 0.080235 0.635174 -0.227201 -0.003862 0.325882 -0.645892 -0.041513 -0.104991 0.020882 -0.043859
3 0.038601 -0.383328 0.101109 0.241325 -0.055190 -0.549525 0.477770 0.499129 0.018625 0.018538
4 0.054694 -0.045669 0.602285 0.139233 0.322844 -0.097687 -0.634786 0.308502 -0.028880 -0.007602
5 -0.145506 0.286288 -0.214977 -0.384503 0.204989 0.317607 0.093739 0.742082 0.014639 0.015057
6 -0.043093 0.311821 0.419644 -0.402546 -0.684498 -0.198130 0.007051 0.063435 -0.191491 -0.119298
7 0.560653 -0.169575 -0.071373 -0.390506 -0.098565 -0.097181 -0.147351 0.025029 0.672724 0.072580
8 0.468798 0.420038 0.372286 0.373057 0.054820 0.346773 0.421397 0.013455 0.130759 -0.088978
9 -0.359494 -0.005879 0.015680 0.102992 -0.015033 0.006151 -0.011871 0.000476 0.443475 -0.814126
In [76]:
plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma')
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c87b7e4a88>





Building the Model to check for feature importances

In [142]:
X = cm_df.drop(['strength'], axis = 1)
y = cm_df['strength']
In [143]:
X_train, X_test, y_train, y_test = train_test_split(X , y ,test_size = 0.20 , random_state = 1)
In [144]:
X_train, X_val, y_train, y_val = train_test_split(X_train , y_train ,test_size = 0.25 , random_state = 1)
In [145]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
(618, 10)
(206, 10)
(206, 10)
In [146]:
scale = StandardScaler() # Standard scaling
scale.fit(X_train.iloc[:,:])# fitting on training data so the data integrity in test is maintained
Out[146]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [147]:
X_train.iloc[:,:] = scale.transform(X_train.iloc[:,:])
X_val.iloc[:,:] = scale.transform(X_val.iloc[:,:])
X_test.iloc[:,:] = scale.transform(X_test.iloc[:,:])
C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
In [148]:
gbmTree = GradientBoostingRegressor(n_estimators=50)
gbmTree.fit(X_train,y_train)
print("gbmTree on training" , gbmTree.score(X_train, y_train))
print("gbmTree on test data ",gbmTree.score(X_val,y_val))
print("gbmTree on test data ",gbmTree.score(X_test,y_test))
gbmTree on training 0.9221144729738221
gbmTree on test data  0.8830770225897295
gbmTree on test data  0.8798721855021808
In [149]:
# View a list of the features and their importance scores
importances = gbmTree.feature_importances_
indices = np.argsort(importances)[::-1][:10]
a = cm_df.columns[:]
features= a.drop(['strength'],1)
#plot it
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
Out[149]:
Text(0.5, 0, 'Relative Importance')
In [150]:
rfTree = RandomForestRegressor(n_estimators=50)
rfTree.fit(X_train,y_train)
print("rfTree on train data ", rfTree.score(X_train,y_train))
print("rfTree on test data ", rfTree.score(X_val,y_val))
print("gbmTree on test data ",gbmTree.score(X_test,y_test))
rfTree on train data  0.9847465514499
rfTree on test data  0.9085570734407855
gbmTree on test data  0.8798721855021808
In [151]:
# View a list of the features and their importance scores
importances = rfTree.feature_importances_
indices = np.argsort(importances)[::-1][:10]
a = cm_df.columns[:]
features= a.drop(['strength'],1)
#plot it
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
Out[151]:
Text(0.5, 0, 'Relative Importance')

Decided to drop below predictors based on multi-collinearity, Cluster analysis, PCA and feature imporatances

To use Train - Validation - Test Set of 3:1:1 ratio

In [152]:
X = cm_df.drop(['strength','coarseagg','ash','cement','wc_ratio'], axis = 1)
y = cm_df['strength']
In [153]:
X_train, X_test, y_train, y_test = train_test_split(X , y ,test_size = 0.20 , random_state = 1)
In [154]:
X_train, X_val, y_train, y_val = train_test_split(X_train , y_train ,test_size = 0.25 , random_state = 1)
In [155]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
(618, 6)
(206, 6)
(206, 6)
In [156]:
scale = StandardScaler() # Standard scaling
scale.fit(X_train.iloc[:,:])# fitting on training data so the data integrity in test is maintained
Out[156]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [157]:
X_train.iloc[:,:] = scale.transform(X_train.iloc[:,:])
X_val.iloc[:,:] = scale.transform(X_val.iloc[:,:])
X_test.iloc[:,:] = scale.transform(X_test.iloc[:,:])
C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s

To use order 1 complexity regression models of SVR, Random Forrest, Bagging & Gradient Boost

In [158]:
svreg = SVR()
rfr = RandomForestRegressor()
gbr = GradientBoostingRegressor()
br = BaggingRegressor()
In [159]:
# Checking the hyperparameters present in the models for clarity
for clf, label in zip([svreg , rfr, gbr, br], ['svreg','rfr','gbr','br']):
    print("model name: " , label)
    print("\n model_hyperparameters \n" , clf.get_params() )
model name:  svreg

 model_hyperparameters 
 {'C': 1.0, 'cache_size': 200, 'coef0': 0.0, 'degree': 3, 'epsilon': 0.1, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'shrinking': True, 'tol': 0.001, 'verbose': False}
model name:  rfr

 model_hyperparameters 
 {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
model name:  gbr

 model_hyperparameters 
 {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'ls', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_iter_no_change': None, 'presort': 'deprecated', 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
model name:  br

 model_hyperparameters 
 {'base_estimator': None, 'bootstrap': True, 'bootstrap_features': False, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 10, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

Time to use Grid Search & Random Search to fine tune models

SVR using GridSearchCV

In [36]:
svreg_gs_param_grid = {
    'C' : [0.01, 0.1 , 1, 10,20, 30 , 50 , 100,200,400,500,1000],
    'gamma' : ['auto','scale'],
    'kernel' : ['poly','rbf']
}
In [37]:
# run grid search
grid_search = GridSearchCV(estimator=svreg, param_grid=svreg_gs_param_grid, cv=10)
#start = time()
grid_search.fit(X_train, y_train)
Out[37]:
GridSearchCV(cv=10, error_score=nan,
             estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                           epsilon=0.1, gamma='scale', kernel='rbf',
                           max_iter=-1, shrinking=True, tol=0.001,
                           verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.01, 0.1, 1, 10, 20, 30, 50, 100, 200, 400, 500,
                               1000],
                         'gamma': ['auto', 'scale'],
                         'kernel': ['poly', 'rbf']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [38]:
#Getting best parameters
grid_search.best_params_
Out[38]:
{'C': 200, 'gamma': 'auto', 'kernel': 'rbf'}
In [39]:
print(grid_search.cv_results_['mean_test_score'])
[3.47988343e-02 7.61155575e-04 3.50470351e-02 7.55305034e-04
 1.84919640e-01 1.71759226e-01 1.84844203e-01 1.71705041e-01
 3.81188790e-01 6.10556029e-01 3.80649744e-01 6.10474439e-01
 4.60060221e-01 7.95294659e-01 4.58999950e-01 7.95301092e-01
 4.38518721e-01 8.11662285e-01 4.38249006e-01 8.11622606e-01
 4.22030213e-01 8.20872271e-01 4.21104451e-01 8.20639339e-01
 3.70494397e-01 8.31791377e-01 3.71013857e-01 8.31533111e-01
 3.27876416e-01 8.41667129e-01 3.28991611e-01 8.41528267e-01
 3.06526649e-01 8.46303692e-01 3.06906034e-01 8.46021978e-01
 2.88475541e-01 8.40898420e-01 2.89172289e-01 8.40646366e-01
 2.81522938e-01 8.35630796e-01 2.80609053e-01 8.35444315e-01
 2.64387668e-01 8.16674276e-01 2.64131605e-01 8.15228184e-01]
In [40]:
grid_search.best_estimator_
Out[40]:
SVR(C=200, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [162]:
svregressor = SVR( C=200, gamma='auto')
svregressor.fit(X_train, y_train)
print("SVR on train data ", svregressor.score(X_train,y_train))
print("SVR on validation data ", svregressor.score(X_val,y_val))
print("SVR on test data ", svregressor.score(X_test,y_test))
SVR on train data  0.9255483735044082
SVR on validation data  0.8817806614945835
SVR on test data  0.8784998667967238

Random Forrest Regression using GridSearchCV

In [44]:
rfr_gs_param_grid = {
    'n_estimators' : [10, 50, 100, 200],
    'max_depth': range(5,10),
    'criterion': ['mse','mae'],
    'min_samples_leaf' : range(1,4),
    'max_features':['auto','sqrt']
}
In [45]:
# run grid search
grid_search = GridSearchCV(estimator=rfr, param_grid=rfr_gs_param_grid, cv=10)
#start = time()
grid_search.fit(X_train, y_train)
Out[45]:
GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'criterion': ['mse', 'mae'], 'max_depth': range(3, 6),
                         'max_features': ['auto', 'sqrt'],
                         'min_samples_leaf': range(1, 4),
                         'n_estimators': [10, 50, 100, 200]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [46]:
# Getting the nest parameters
grid_search.best_params_
Out[46]:
{'criterion': 'mse',
 'max_depth': 5,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'n_estimators': 50}
In [163]:
rfTree = RandomForestRegressor(n_estimators=50, max_depth=11, max_features='auto',min_samples_leaf=1, criterion='mse')
rfTree.fit(X_train,y_train)
print("Random Forrest on train data ", rfTree.score(X_train,y_train))
print("Random Forrest on validation data ", rfTree.score(X_val,y_val))
print("Random Forrest on validation data ", rfTree.score(X_test,y_test))
Random Forrest on train data  0.9803517724084108
Random Forrest on validation data  0.9009933563226278
Random Forrest on validation data  0.9096718884175511

Graddient Boost Regression Regression using GridSearchCV

In [63]:
gbr_gs_param_grid = {
    'n_estimators' : [50, 100],
    'max_depth': range(5,10),
    'criterion': ['mse','mae'],
    'min_samples_leaf' : range(1,4),
    'max_features':['auto','sqrt'],
    'learning_rate' : [0.001, 0.01, 0.05,0.1]
}
In [64]:
# run grid search
grid_search = GridSearchCV(estimator=gbr, param_grid=gbr_gs_param_grid, cv=10)
#start = time()
grid_search.fit(X_train, y_train)
Out[64]:
GridSearchCV(cv=10, error_score=nan,
             estimator=GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.1,
                                                 loss='ls', max_depth=3,
                                                 max_features=None,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100,
                                                 n_iter_...
                                                 validation_fraction=0.1,
                                                 verbose=0, warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'criterion': ['mse', 'mae'],
                         'learning_rate': [0.001, 0.01, 0.05, 0.1],
                         'max_depth': range(5, 10),
                         'max_features': ['auto', 'sqrt'],
                         'min_samples_leaf': range(1, 4),
                         'n_estimators': [50, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [65]:
# Getting the best parameters
grid_search.best_params_
Out[65]:
{'criterion': 'mse',
 'learning_rate': 0.1,
 'max_depth': 6,
 'max_features': 'sqrt',
 'min_samples_leaf': 3,
 'n_estimators': 100}
In [164]:
gbmTree = GradientBoostingRegressor(n_estimators=100, criterion='mse', learning_rate=0.1, max_depth=6, max_features='sqrt',
                                    min_samples_leaf=3)
gbmTree.fit(X_train,y_train)
print("Gradient Boost on training data" , gbmTree.score(X_train, y_train))
print("Gradient Boost on validation data ",gbmTree.score(X_val,y_val))
print("Gradient Boost on test data ",gbmTree.score(X_test,y_test))
Gradient Boost on training data 0.9906563372330728
Gradient Boost on validation data  0.9193656757998036
Gradient Boost on test data  0.9247859092279108

Bagging Regression using GridSearchCV

In [26]:
br_gs_param_grid = {
    'n_estimators' : [50, 100, 200, 400, 500, 1000],
    'max_features':range(1,6),
    'bootstrap':[True, False]
}
In [27]:
# run grid search
grid_search = GridSearchCV(estimator=br, param_grid=br_gs_param_grid, cv=10)
#start = time()
grid_search.fit(X_train, y_train)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-27-0fdd54e083cb> in <module>
      2 grid_search = GridSearchCV(estimator=br, param_grid=br_gs_param_grid, cv=10)
      3 #start = time()
----> 4 grid_search.fit(X_train, y_train)

~\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    708                 return results
    709 
--> 710             self._run_search(evaluate_candidates)
    711 
    712         # For multi-metric evaluation, store the best_index_, best_params_ and

~\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
   1149     def _run_search(self, evaluate_candidates):
   1150         """Search all candidates in param_grid"""
-> 1151         evaluate_candidates(ParameterGrid(self.param_grid))
   1152 
   1153 

~\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params)
    687                                for parameters, (train, test)
    688                                in product(candidate_params,
--> 689                                           cv.split(X, y, groups)))
    690 
    691                 if len(out) < 1:

~\Anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
    922                 self._iterating = self._original_iterator is not None
    923 
--> 924             while self.dispatch_one_batch(iterator):
    925                 pass
    926 

~\Anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

~\Anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

~\Anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

~\Anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

~\Anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~\Anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    513             estimator.fit(X_train, **fit_params)
    514         else:
--> 515             estimator.fit(X_train, y_train, **fit_params)
    516 
    517     except Exception as e:

~\Anaconda3\lib\site-packages\sklearn\ensemble\_bagging.py in fit(self, X, y, sample_weight)
    241         self : object
    242         """
--> 243         return self._fit(X, y, self.max_samples, sample_weight=sample_weight)
    244 
    245     def _parallel_args(self):

~\Anaconda3\lib\site-packages\sklearn\ensemble\_bagging.py in _fit(self, X, y, max_samples, max_depth, sample_weight)
    378                 total_n_estimators,
    379                 verbose=self.verbose)
--> 380             for i in range(n_jobs))
    381 
    382         # Reduce

~\Anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

~\Anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

~\Anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

~\Anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

~\Anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

~\Anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~\Anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\ensemble\_bagging.py in _parallel_build_estimators(n_estimators, ensemble, X, y, sample_weight, seeds, total_n_estimators, verbose)
     92                                                       bootstrap, n_features,
     93                                                       n_samples, max_features,
---> 94                                                       max_samples)
     95 
     96         # Draw samples, using sample weights, and then fit

~\Anaconda3\lib\site-packages\sklearn\ensemble\_bagging.py in _generate_bagging_indices(random_state, bootstrap_features, bootstrap_samples, n_features, n_samples, max_features, max_samples)
     55                                         n_features, max_features)
     56     sample_indices = _generate_indices(random_state, bootstrap_samples,
---> 57                                        n_samples, max_samples)
     58 
     59     return feature_indices, sample_indices

~\Anaconda3\lib\site-packages\sklearn\ensemble\_bagging.py in _generate_indices(random_state, bootstrap, n_population, n_samples)
     36     # Draw sample indices
     37     if bootstrap:
---> 38         indices = random_state.randint(0, n_population, n_samples)
     39     else:
     40         indices = sample_without_replacement(n_population, n_samples,

KeyboardInterrupt: 
In [70]:
# Getting the best parameters
grid_search.best_params_
Out[70]:
{'bootstrap': True, 'max_features': 5, 'n_estimators': 500}
In [ ]:
print(grid_search.cv_results_['mean_test_score'])
In [165]:
bgcl = BaggingRegressor(n_estimators=500, bootstrap= True, max_features=5)
bgcl = bgcl.fit(X_train,y_train)
print("bgcl on train data ", bgcl.score(X_train,y_train))
print("bgcl on validation data ", bgcl.score(X_val,y_val))
print("bgcl on test data ", bgcl.score(X_test,y_test))
#print("out of bag score" , bgcl.oob_score_)
bgcl on train data  0.9715665325089957
bgcl on validation data  0.8761931966489279
bgcl on test data  0.8896196337416095

Bagging Regression using Random SearchCV

In [166]:
svreg = SVR()
rfr = RandomForestRegressor()
gbr = GradientBoostingRegressor()
br = BaggingRegressor()
In [167]:
br_rs_param_grid = {
    'n_estimators' : [50, 100, 200, 400, 500, 1000],
    'max_features':range(1,6),
    'bootstrap':[True, False]
}
In [168]:
# run randomized search
samples = 60  # number of random samples 
randomCV = RandomizedSearchCV(br, param_distributions=br_rs_param_grid, n_iter=samples) #default cv = 3
In [169]:
randomCV.fit(X_train, y_train)

 
print(randomCV.best_params_)
{'n_estimators': 1000, 'max_features': 5, 'bootstrap': True}
In [186]:
bgcl = BaggingRegressor(n_estimators=500, bootstrap= True, max_features=5)
bgcl = bgcl.fit(X_train,y_train)
print("bgcl on train data ", bgcl.score(X_train,y_train))
print("bgcl on test data ", bgcl.score(X_val,y_val))
print("bgcl on test data ", bgcl.score(X_test,y_test))
#print("out of bag score" , bgcl.oob_score_)
bgcl on train data  0.9759246663040058
bgcl on test data  0.8859127871413491
bgcl on test data  0.8941688407578552

Gradient Boost Regression Regression using Random SearchCV

In [170]:
gbr_rs_param_grid = {
    'n_estimators' : [50, 100, 200, 400, 500, 1000],
    'max_depth': range(4,7),
    'criterion': ['mse','mae'],
    'min_samples_leaf' : sp_randint(1, 8),
    'max_features':sp_randint(1,6),
    'loss' : ['ls', 'lad', 'huber', 'quantile'],
    'learning_rate' : [0.001, 0.01, 0.05,0.1, 0.2,0.3]
}
In [171]:
# run randomized search
samples = 1500  # number of random samples 
randomCV = RandomizedSearchCV(gbr, param_distributions=gbr_rs_param_grid, n_iter=samples) #default cv = 3
In [172]:
randomCV.fit(X_train, y_train)

 
print(randomCV.best_params_)
{'criterion': 'mse', 'learning_rate': 0.05, 'loss': 'huber', 'max_depth': 5, 'max_features': 2, 'min_samples_leaf': 7, 'n_estimators': 1000}
In [174]:
gbmTree = GradientBoostingRegressor( criterion='mse', learning_rate = 0.05, loss= 'huber', max_depth= 5, 
                                    max_features= 2, min_samples_leaf= 7, n_estimators= 1000)
gbmTree.fit(X_train,y_train)
print("gbmTree on training" , gbmTree.score(X_train, y_train))
print("gbmTree on validation data ",gbmTree.score(X_val,y_val))
print("gbmTree on test data ",gbmTree.score(X_test,y_test))
gbmTree on training 0.9911083970930431
gbmTree on validation data  0.9297078867084143
gbmTree on test data  0.93635651472253

Random Forrest Regression using Random SearchCV

In [180]:
rfr_rs_param_grid = {
    'n_estimators' : [50, 100, 200, 400, 500, 1000],
    'max_depth': range(5,10),
    'criterion': ['mse','mae'],
    'min_samples_leaf' : sp_randint(1, 5),
    'max_features':sp_randint(1, 6)
}
In [181]:
# run randomized search
samples = 1000  # number of random samples 
randomCV = RandomizedSearchCV(rfr, param_distributions=rfr_rs_param_grid, n_iter=samples) #default cv = 3
In [182]:
randomCV.fit(X_train, y_train)

 
print(randomCV.best_params_)
{'criterion': 'mse', 'max_depth': 9, 'max_features': 5, 'min_samples_leaf': 1, 'n_estimators': 500}
In [183]:
rfTree = RandomForestRegressor(n_estimators=500, max_depth=9, max_features=5 ,min_samples_leaf=1, criterion='mse')
rfTree.fit(X_train,y_train)
print("rfTree on train data ", rfTree.score(X_train,y_train))
print("rfTree on validation data ", rfTree.score(X_val,y_val))
print("rfTree on test data ", rfTree.score(X_test,y_test))
rfTree on train data  0.9744523764583404
rfTree on validation data  0.9056178097487761
rfTree on test data  0.9026463591610014

SVR using Random SearchCV

In [179]:
svreg_rs_param_grid = {
    'C' : [0.01, 0.1 , 1, 10,20, 30 , 50 , 100, 200, 300, 500],
    'gamma' : [0.01, 0.1,0.05,0.5,1,10,20, 'auto', 'scale']
}
In [60]:
# run randomized search
samples = 99  # number of random samples 
randomCV = RandomizedSearchCV(svreg, param_distributions=svreg_rs_param_grid, n_iter=samples) #default cv = 3
In [61]:
randomCV.fit(X_train, y_train)

 
print(randomCV.best_params_)
{'gamma': 0.1, 'C': 500}
In [185]:
svregressor = SVR( C=500, gamma=0.1)
svregressor.fit(X_train, y_train)
print("SVR on train data ", svregressor.score(X_train,y_train))
print("SVR on validation data ", svregressor.score(X_val,y_val))
print("SVR on test data ", svregressor.score(X_test,y_test))
SVR on train data  0.9189461858661077
SVR on validation data  0.885118145616623
SVR on test data  0.8769309533239076
In [252]:
y_test
Out[252]:
339    47.78
244    48.79
882    33.70
567    18.28
923    14.99
       ...  
258    42.29
551    49.77
528    32.40
812    24.66
50     49.25
Name: strength, Length: 206, dtype: float64





Gradient Boosting is the model to rely on as it gives better R2 results

Take a look at the feature importances

In [264]:
# View a list of the features and their importance scores
importances = gbmTree.feature_importances_
indices = np.argsort(importances)[::-1][:6]
a = cm_df.columns[:]
features= a.drop(['strength','coarseagg','ash','cement','wc_ratio'],1)
#plot it
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
Out[264]:
Text(0.5, 0, 'Relative Importance')



Bootstrap Sampling

In [253]:
Xdf = cm_df.drop(['coarseagg','ash','cement','wc_ratio'], axis = 1)
In [254]:
strength = Xdf['strength']
Xdf.drop(labels=['strength'], axis=1, inplace=True)
Xdf.insert(6,'strength',strength )
In [255]:
scale = StandardScaler() # Standard scaling
scale.fit(Xdf.iloc[:,:-1])# fitting on training data so the data integrity in test is maintained
Out[255]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [256]:
Xdf.iloc[:,:-1] = scale.transform(Xdf.iloc[:,:-1])
In [257]:
Xdf.head()
Out[257]:
slag water superplastic fineagg age wb_ratio strength
0 1.623619 1.039191 -1.070393 -0.312220 -0.279733 1.862131 29.89
1 -0.367670 -1.099900 0.812800 0.287507 -0.501465 0.755956 23.51
2 -0.862561 0.277258 -0.111360 1.104745 -0.279733 0.180374 29.22
3 0.474347 2.198654 -1.070393 -1.299192 -0.279733 0.524486 45.85
4 1.288219 0.556475 0.516371 -0.963496 -0.279733 1.499312 18.29
In [258]:
val = Xdf.values
In [259]:
len(Xdf)
Out[259]:
1030
In [260]:
val
Out[260]:
array([[ 1.62361886,  1.03919055, -1.0703929 , ..., -0.27973311,
         1.86213089, 29.89      ],
       [-0.36767014, -1.09990047,  0.81279977, ..., -0.50146528,
         0.75595577, 23.51      ],
       [-0.86256058,  0.27725769, -0.1113596 , ..., -0.27973311,
         0.18037422, 29.22      ],
       ...,
       [ 0.49780176, -0.09187749,  0.48149736, ..., -0.27973311,
        -0.17916054, 44.28      ],
       [-0.41692464,  2.1986536 , -1.0703929 , ...,  3.55306569,
        -0.11532026, 55.06      ],
       [-0.86256058, -0.40422264, -1.0703929 , ..., -0.61233136,
        -1.75140176, 52.61      ]])
In [261]:
# configure bootstrap
n_iterations = 1000              # Number of bootstrap samples to create
n_size = int(len(Xdf) * 0.70)    # picking only 50 % of the given data in every bootstrap sample

# run bootstrap
stats = list()
for i in range(n_iterations):
	# prepare train and test sets
	train = resample(val, n_samples=n_size)  # Sampling with replacement 
	test = np.array([x for x in val if x.tolist() not in train.tolist()])  # picking rest of the data not considered in sample
    # fit model
	model = GradientBoostingRegressor( criterion='mse', learning_rate = 0.05, loss= 'huber', max_depth= 5, 
                                    max_features= 2, min_samples_leaf= 7, n_estimators= 1000)
	model.fit(train[:,:-1], train[:,-1])
    # evaluate model
	#predictions = model.predict(test[:,:-1])
	score = model.score(test[:,:-1], test[:,-1])    # caution, overall accuracy score can mislead when classes are imbalanced
	print(score)
	stats.append(score)
0.909245264111418
0.9214779856058664
0.9087612709686288
0.9201737493571467
0.9194740071411399
0.9211995212803537
0.8945261949280091
0.919182288348529
0.9120475636682105
0.9131412638797309
0.8819480004975356
0.9085261331781126
0.9042957509902821
0.9189335272959871
0.9036857819071018
0.9042796037593975
0.9277029019526807
0.9194891796111914
0.906280105163146
0.9075383053554626
0.9047175337107822
0.9234987408823416
0.9163511514173733
0.9094958952117508
0.9145577869969066
0.9197281277637607
0.9083466604437898
0.9141654469044762
0.9062645736961847
0.901044335773415
0.9058415920552297
0.9164307083425817
0.9063428960608313
0.9245028937198271
0.899654617932296
0.9253466469535724
0.901322977484797
0.8922192322611346
0.9187652161475461
0.9088059647389215
0.9055584914002008
0.9138158164570189
0.9106710207768386
0.9015557734158227
0.9089448484100997
0.9234504275927775
0.9133019784731741
0.9216112562142522
0.8987414727853652
0.9195602355648165
0.9177184554098065
0.924247157277586
0.9140155930886867
0.9230805955536195
0.9032959363828341
0.9066058000130488
0.9191670296852833
0.9074614690790168
0.9122472705101329
0.9213813317423927
0.8937061139784681
0.9042194283260175
0.9067693338735858
0.9177705163641643
0.8955257750926124
0.9035894900508262
0.9075371400749614
0.923395898715184
0.9136326909098037
0.910343313595646
0.9234101277336075
0.9191882615369114
0.9039312429767656
0.9119934888397334
0.9101234992119969
0.9071872932602162
0.8878446841146033
0.9069043575663891
0.9134510763176314
0.9181002940106562
0.9080649147021375
0.9077967847419222
0.9253788127316108
0.9113028325802074
0.9291480364354818
0.910902873947116
0.8979867255168869
0.9293999585816105
0.9395284781633773
0.9042070698677506
0.9245841725661534
0.9054093231130977
0.908439246099811
0.9201581809923361
0.9040115647127412
0.9209395182196386
0.9040061627323185
0.9197061883062663
0.9014422349545781
0.911506737108336
0.9210115375779293
0.9219857891600555
0.9304036676053102
0.8870948551889037
0.9165227532396854
0.9220671327522693
0.930487465774887
0.9019218121387214
0.9240311985429739
0.9116602784033867
0.8964863259505312
0.9160401060731124
0.9044643068550938
0.9264141367752878
0.9073248498875501
0.9080552895711294
0.9040172869607987
0.913593173260603
0.9233123847273925
0.9066863736622915
0.926965111296704
0.8983744699044305
0.9111585648687717
0.9158659718945351
0.9026777826437288
0.9072075068655734
0.893998254789677
0.9148728894298852
0.8923073510115952
0.921087279629524
0.9173575396888141
0.9114056657209222
0.9127637685999973
0.9085314989506406
0.915968382551626
0.9131000047707577
0.9194690588936617
0.9117710677937572
0.9083080843232861
0.9175879297089446
0.9064044541022865
0.9101975230255209
0.9127582537133442
0.921252478397069
0.9175611339203454
0.9029640566368291
0.9124647829775823
0.9036511609657456
0.9089962693965415
0.9185366118810984
0.9132907928913252
0.9076808839930401
0.9112746117858908
0.8826064770239846
0.9197805792384073
0.8947533953355291
0.9202265382290227
0.9165723086385582
0.9036360717933024
0.9091934008948716
0.9101425332147691
0.9053103477948735
0.9109951963810116
0.9218245199918191
0.9071540997782174
0.9145676331934911
0.9096147738751513
0.9183991543177866
0.9117198328954113
0.9112805928670851
0.9114196247971236
0.9025942680236412
0.8989651552169987
0.9236952578066049
0.9135078722851735
0.9211928785397224
0.9242965913013472
0.9281857840677259
0.9308622639606515
0.9232069740844879
0.9082873023585841
0.9242659563310296
0.8919760114802217
0.9035628959015682
0.9023939344830257
0.8987474854833423
0.9109184420269685
0.9079959946050511
0.9115993748575812
0.9186264914735878
0.9326318449489136
0.9185494712171844
0.9075361577806572
0.9144975635114118
0.9234841668193774
0.908757444576169
0.9170549745129853
0.9169455029609163
0.9107616669639691
0.9254538326882064
0.9037746776475182
0.9103401188377429
0.9280449050242835
0.9267815890682161
0.9053897851626425
0.9116424033412717
0.90714970346868
0.9150106739839428
0.8927863405608423
0.909459402630879
0.901803919198013
0.9148677553404277
0.9277888159746942
0.9213289105930909
0.928625115320714
0.9157526832690633
0.9129627219063934
0.8944712262511596
0.9127528646895241
0.916229520763667
0.9025853047221563
0.9052671503116999
0.8993224098040231
0.9188362808269528
0.9122981521720489
0.9118201243671049
0.8960265417771359
0.9146817231139662
0.9032728306287301
0.909611909879573
0.9270723125859104
0.9067985783076378
0.905249782565859
0.9161175452182637
0.9101768025990074
0.8982942190462178
0.9125352624720061
0.9062792893168259
0.9002377006171918
0.9296292167700105
0.9152958154023497
0.9183219150542542
0.9129664032305999
0.9350916648545764
0.8921329727892869
0.9029323854342859
0.9047968680734313
0.9035199532445832
0.916834111256251
0.914868220761057
0.9118790952724803
0.9009095103823583
0.9144747460901584
0.8859714234855055
0.9068397148587806
0.9071270552698796
0.9206509369857869
0.9090534895227576
0.9234129470726931
0.9161872650819535
0.928219514723176
0.8804348125635331
0.9065974039542777
0.9073740135267118
0.9084543709869212
0.9101908889648702
0.9176083312034713
0.9138060313064127
0.9272824499416634
0.8937686899179765
0.90444190139197
0.906822323004258
0.9126164480468285
0.9128450087359147
0.9009422376532704
0.9070860642766669
0.9413855509529788
0.9110849742354661
0.9023890281409601
0.9114272281069413
0.8963989151139047
0.9061023735967763
0.9050336557478211
0.905617124225008
0.9174283444673899
0.9073439359472824
0.9077899903084631
0.9022354387706317
0.8875233743317554
0.9252451095982787
0.902355323907565
0.91448940032629
0.9106006798411146
0.9111748951912306
0.904014903059884
0.9261239405044724
0.9241902135299935
0.8966552900843249
0.9194164255462707
0.8979261113717912
0.8975428940811807
0.9067802561950937
0.9137528586175522
0.9070652389542175
0.9203904382507331
0.9109299587061328
0.9084048950182917
0.9294598210279139
0.9166770378516569
0.9011734069019399
0.8881361759356329
0.9077591163726649
0.8988369113260487
0.9018353118071717
0.9114142312982378
0.9106757168975295
0.916358864583955
0.9014493127165051
0.9016332533552514
0.9102841658961432
0.9192778633957747
0.9076033323431258
0.9038214656027143
0.903126912002389
0.9031744261965188
0.9038257276125287
0.9081251942765203
0.9111572234283227
0.9218916299943755
0.9382943871083134
0.903720561248033
0.9099415524144217
0.913744701126174
0.9210347185370568
0.9011866343258725
0.913663838413664
0.9033473680079938
0.8938176120203838
0.8910322778039483
0.916997388939431
0.9072101748333881
0.9050903444555664
0.9072914627413864
0.8963224892956426
0.9194276846939744
0.9271756063461849
0.9052534742212198
0.8975867472432262
0.9117773119153258
0.911685580488434
0.9270084558583136
0.916833770102244
0.9142132984236794
0.9118636670218194
0.9120724158906015
0.9129318887333813
0.9120130871238403
0.9057031699654613
0.9081321870771151
0.9129176892904192
0.877436162365069
0.930432146372326
0.9166738396091881
0.8935530553756583
0.901397039990676
0.9057413052457263
0.9220258208904191
0.9219783552960552
0.9098641539125688
0.923191537259038
0.9038975061426014
0.9055527045198896
0.9165540695405198
0.8978794037984295
0.9051444614925087
0.9107844434952368
0.9028904083989937
0.8992587571982236
0.8952078773410698
0.9030950894557117
0.9055275017630482
0.9241478545704913
0.9199654851757028
0.9145868289083482
0.9150288341418937
0.9322827962613196
0.917082530939012
0.908252012487264
0.9125503644095493
0.9030463589838618
0.9023622846567907
0.8988079158607051
0.9131837385317618
0.9083710618170138
0.9092671291013182
0.8802229146234176
0.9105087317247829
0.9028323378496799
0.8965899504761056
0.8887590394508454
0.9241340106106501
0.8992880138012882
0.9098383594455695
0.9056644206539788
0.9024743022848586
0.8934275587367597
0.904395606728753
0.9269629967649489
0.9142402206972291
0.9162398412289646
0.9242365105894008
0.9112716338437534
0.8820825163220885
0.9252529661798791
0.9117692356093893
0.9066095073352726
0.8983031193419024
0.9001347833380493
0.9125487608411186
0.9124634942302824
0.9223536964822264
0.9171544723419214
0.9156619549960178
0.9179181762585594
0.9151017860832213
0.9160186635466167
0.9070122746004833
0.9114787382277534
0.9123575372516985
0.9213461100122924
0.9254188247715267
0.9336851351222116
0.9051358120313564
0.9175950243655507
0.9061465983138266
0.914319031235361
0.9128422299945745
0.9137776404295339
0.9182438049886006
0.8986093965502369
0.8969748297058392
0.9207934702848232
0.9184456723148756
0.9109056566689371
0.9168982912899075
0.8951392400820444
0.8986426297650951
0.8848261038075399
0.9060930099542154
0.9193668661288867
0.9004155959328449
0.9199210386226295
0.9177263329975087
0.9088590775609429
0.902875460249938
0.9221404523125067
0.9124387963935758
0.9178781612322892
0.8879834201450238
0.9245227459626477
0.9139510215521376
0.904173725246754
0.9202405928588339
0.9195918065518478
0.9045052059614291
0.918044002106686
0.9238619220092655
0.9294834560917485
0.9330471609911847
0.9052949700659043
0.9070614965864949
0.9192988335170238
0.9141113532534849
0.9040656448725807
0.9066634732720759
0.9095072892278072
0.9115156923663733
0.907895504519659
0.9247307130672069
0.9105778947935521
0.9131398381777693
0.8876650207634028
0.9011946079630933
0.8955964741791966
0.9079616884923731
0.9346975377556167
0.9283076787572889
0.9259524770269555
0.9188046081850341
0.9273353565644717
0.9266741331344922
0.9009130890995816
0.9085128630229169
0.9001489993592415
0.9133452510576627
0.9099610392354477
0.8926724206556255
0.9202897640177626
0.9154852417523762
0.9134691111400547
0.9232669136154504
0.9173127241328857
0.9081768674631153
0.9066875047865013
0.9102358380358285
0.9109033554801276
0.9005427604792672
0.922107653369311
0.916886686727518
0.897554272418372
0.8990278511995156
0.9001917080980899
0.9149399092960447
0.9112560655767485
0.9328685893225998
0.9118591900978251
0.9238502358932552
0.9081072567767159
0.9216756694706159
0.9156785732094048
0.9060087392991371
0.9136958287312659
0.9237321635003594
0.9123330276366957
0.9182895743826432
0.8970700438665815
0.9043759725954531
0.9139364642077166
0.8967224204600192
0.9127584858875485
0.9153652403501377
0.9127979554936938
0.9134110522633688
0.9159926537508506
0.9116874771215232
0.9151769733798213
0.9150440159917433
0.9142017834540321
0.9191095351569075
0.9271251139305696
0.9214168853243222
0.9128767920743516
0.9096038759850016
0.8990854728930961
0.9074136011934909
0.9215463518565598
0.8943961804490133
0.8999011636227663
0.9043594655410852
0.9146782173163089
0.9152183482644536
0.9155838537901307
0.9134862991747434
0.9006612562836713
0.9197473206462172
0.9366836745789029
0.9172879139369278
0.9151872847283058
0.930975048173747
0.9186943587598806
0.9175241279960982
0.904484231027447
0.9048371367521545
0.9201271572683702
0.9105022117351864
0.8834692795278349
0.9244005190961859
0.9055381385986465
0.9148945477389913
0.8986606191399897
0.9004469606992872
0.9320368847794949
0.9044445030685544
0.8994947406101683
0.9097300341977982
0.9001538443326389
0.9144302074973081
0.9154243592007254
0.9261243077867406
0.8997089970214405
0.9040409098824302
0.9116389087095494
0.9121147226280304
0.9151617729117157
0.9197211665866856
0.8998187215811109
0.9155541384659674
0.9009378304749245
0.8906467838247848
0.8963881387570881
0.92187972241987
0.9023362379242683
0.9139626958612372
0.9188651560907614
0.9318153120493177
0.9295937857222503
0.8979058675481582
0.88382599593676
0.9134907122218525
0.9112130853600259
0.9095729847911552
0.9178633303817776
0.9087652661748161
0.9023493847368671
0.896890318054762
0.9102789602038349
0.9193299632419462
0.9001531669847754
0.9278152985966915
0.9176017837135775
0.9210422250691713
0.9018452044897859
0.9017753264571672
0.916636554558286
0.9039056857537467
0.8989492562058343
0.8963623819471723
0.9307753061610889
0.9242108208622027
0.9140286427907264
0.8831043226062033
0.9042412810682737
0.9065978135322046
0.8918440564847473
0.9121179426942593
0.9155691660781795
0.9125941537538114
0.9270732211421456
0.9079239028205074
0.9161292298623174
0.9112349071603527
0.921783833358206
0.9010331569776938
0.9001770058946419
0.9125042504326212
0.9134118570412212
0.9031060709963833
0.91190474496654
0.926563630587249
0.9032606379273143
0.9151011621677321
0.9142413700016668
0.9161634186885347
0.9213361790935969
0.923481150092241
0.9169537179383308
0.9096569606714584
0.9325517091032886
0.9064622206083655
0.9107565554249305
0.9072457926623051
0.9245755253066283
0.9194232654377601
0.9107834558292012
0.8955734478545847
0.914821288280016
0.8982425225258652
0.9087068817555439
0.9120667266506852
0.9187322689126118
0.9209866673787278
0.919346051614921
0.9103546636795447
0.9237633266439226
0.9138716342507802
0.8868188995896858
0.9166137699492308
0.9080025258996312
0.8926822222349334
0.9148616634359132
0.9038986396832122
0.9010116128092642
0.8993047955996643
0.9204627893977206
0.9318562606812234
0.9182909493447086
0.9219320093879321
0.922941922891664
0.9128700877919316
0.9382136865541557
0.9205287710449758
0.9160533522495011
0.921450301431793
0.9102891637890285
0.9134635295396137
0.9020287237146423
0.9138959580653818
0.9111002180357577
0.9125236436612313
0.909255279973009
0.9062377316208408
0.9177187342255686
0.9202045812972082
0.922197165912689
0.8934721056351849
0.9161064717684825
0.8986792223547585
0.9163644562365424
0.9041412133585038
0.9080741912060473
0.9097615458170193
0.9143246001988138
0.9085646702862711
0.9129504548938394
0.9195654715478349
0.9108508407973671
0.9166806236698544
0.911842238111863
0.9130506323365977
0.9078615907323161
0.9251684669510059
0.9338550831332998
0.9076095751602689
0.9200990202145877
0.8960699057885091
0.9232603065176961
0.8998482185011843
0.9200294882712692
0.91555954929897
0.9157539695512793
0.9134971665748087
0.9219926771306373
0.9054813614365678
0.8972543837429631
0.8978492615544917
0.9101590957303011
0.9013664858147029
0.9211611392392621
0.9143969277906856
0.9188583137702795
0.9188762457530718
0.9157991934796812
0.9183880474079089
0.9114114332944837
0.9304526315083
0.8974560058694103
0.9216976082686118
0.9114086984462085
0.9175881034349963
0.918846431504609
0.9076065972184093
0.9019256207472695
0.9217168488949474
0.9233614389265914
0.8954167930624871
0.9077791336833394
0.9204164811637446
0.9250098013405855
0.9202135893391432
0.9017628814360155
0.9125038481762602
0.9065189412526354
0.914262018087419
0.9151798057423445
0.9026751133918687
0.9112927640417685
0.9100785311961205
0.8933219920687757
0.8860369080480169
0.905590257855406
0.9113498581801385
0.9151815830466501
0.9283522776442557
0.8988523541197001
0.901291534542184
0.8918027065127306
0.9256365915495262
0.9190142129634684
0.910636617212018
0.9029829363445234
0.9084083083960293
0.9072230130878535
0.9086535588665143
0.8868923411413655
0.9141583567746885
0.9065132726777583
0.919989924225472
0.9178316173597565
0.9095307818699498
0.9025677572856061
0.919250998743961
0.9137247502584287
0.9163040919258097
0.9154545178065665
0.9182966962299738
0.9197971480079024
0.9070928961432022
0.9164321149188919
0.9047766863174846
0.9162380954504529
0.8967660375806554
0.893894695248792
0.9107700660472169
0.90466920096815
0.9126991653608875
0.9174714780045159
0.9152904784599908
0.9076953223168125
0.912835950727636
0.922497877812248
0.92199118538729
0.9234367012867793
0.9122230353745847
0.905640019631679
0.9103024993837908
0.9209765988918792
0.9044179720694459
0.9136015408862737
0.9179836272680051
0.9019286156393838
0.9173583635883634
0.9144540618429572
0.8989489375442212
0.8870617829367003
0.9289877513962044
0.9130327793761681
0.9127090795506504
0.9020090939471699
0.912586499306567
0.9358834970068673
0.9104576456786597
0.9169994082972297
0.9232112388188228
0.8997732527078279
0.898833511148665
0.9080792579010619
0.9183560981403811
0.9065841211544768
0.9189738398718066
0.9109540398504541
0.906017999223313
0.9011349715385543
0.9211741452299151
0.9183897434416859
0.9120806609564502
0.9133587726308567
0.922303414092956
0.8942091956142477
0.9246506773568952
0.9131801293715971
0.9180690990467865
0.9276595436280582
0.9039847375502275
0.9093645235746203
0.902248593638692
0.9038477080909688
0.913613125992777
0.9146645921838039
0.8890808078021676
0.9070962157739492
0.9041731536706468
0.8977647499739204
0.9061565094448578
0.9060445900285723
0.9366175460426227
0.899113465696855
0.9070999119396214
0.9135641740169237
0.9201876888217858
0.896931399742595
0.8936563425138463
0.9050530712162658
0.8941988774896793
0.9114278207984462
0.909708717783001
0.8949259675997885
0.9209042825830285
0.9068836815786051
0.9149130250104821
0.9148431426819215
0.9066529044207461
0.9126378336426446
0.9022190328313895
0.9048582851433752
0.8902331303659933
0.913262545609802
0.9097654255822387
0.8902473235372611
0.9138077608564971
0.9064392719140416
0.9127129786625909
0.919906076920874
0.9144391715051967
0.9069381468080764
0.9201244863914028
0.9277054256356005
0.9119297879649083
0.9069058466404699
0.916175702540985
0.911938904715726
0.9161706668661923
0.9091186400323102
0.8996926522603936
0.9271773748043721
0.9115152346622996
0.8996651295460735
0.9050965972266762
0.9000562196830221
0.8890990121154885
0.9246571820854463
0.9209209036546548
0.9122457838112263
0.9221973562893714
0.9378582857866219
0.9109541574797994
0.9147279961377968
0.9146010202097627
0.9146499225174637
0.9046084105026524
0.9182230304367818
0.9004661419670387
0.9196369343406113
0.9066873056057583
0.9129956548515373
0.902115097152176
0.9163485325403025
0.9200311114921381
0.9124435306476311
0.9028368461124139
0.8894880576673069
0.9122650377040338
0.9060519464427272
0.9256902621389169
0.907834959576494
0.9065452874208443
0.9128443902417236
0.899643061133937
0.9282071339687541
0.9174671339721391
0.9125481259564958
0.9015752645280184
0.9200723067480574
0.9241135029857319
0.8997165227976908
0.9063444239799178
0.9171443777424216
0.9076660704397883
0.9016364470369681
0.8982613365689688
0.896889450879976
0.9062545457180874
0.9005563831568851
0.9248456325083882
0.9219635732982194
0.908255989064211
0.9008670390317791
0.9147455817307658
0.90439637081415
0.9319168911727675
0.9028006367765676
0.9125100815313731
0.9163976442353043
0.9097121082799086
0.9191456579360988
0.9212303389068015
0.9260052243128134
0.9170047518456822
0.9091154385194603
0.9333296324903948
0.9184701295802664
0.8968762586799441
0.9104288987850468
0.8984135366365388
0.9027691598037042
0.9042973259832443
0.922711165639434
0.9146814110032147
0.9053441111686927
0.9086222470515503
0.9160292287016355
0.9209367704469235
0.9156187135731125
0.9191440702088602
0.9081883063296293
0.9146523533823796
0.8929522167749575
0.909018247302107
0.9059546250404543
0.9251446578411817
0.9017305260634699
0.9082154201726366
0.9023495834706103
0.9112502686340406
0.9144779800036973
0.9019309795589122
0.9097611993162856
0.9041509033683163
0.9140932707157948
0.9134795382916508
0.9127802536322023
0.8919342261546935
0.9120318000235408
0.918890184859777
0.9232366768245668
0.9045381585772362
0.8827957896759574
0.9088026359543737
0.9074642610146628
0.9313344444093021
0.8917255028573122
In [263]:
# plot scores
plt.hist(stats)
plt.show()
# confidence intervals
alpha = 0.95                             # for 95% confidence 
p = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))  
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.3f confidence interval %.3f%% and %.3f%%' % (alpha*100, lower*100, upper*100))
95.000 confidence interval 88.948% and 93.078%



K-Fold Cross Validation

In [187]:
X = cm_df.drop(['strength','coarseagg','ash','cement','wc_ratio'], axis = 1)
y = cm_df['strength']
In [188]:
X.apply(zscore)
Out[188]:
slag water superplastic fineagg age wb_ratio
0 1.623619 1.039191 -1.070393 -0.312220 -0.279733 1.862131
1 -0.367670 -1.099900 0.812800 0.287507 -0.501465 0.755956
2 -0.862561 0.277258 -0.111360 1.104745 -0.279733 0.180374
3 0.474347 2.198654 -1.070393 -1.299192 -0.279733 0.524486
4 1.288219 0.556475 0.516371 -0.963496 -0.279733 1.499312
... ... ... ... ... ... ...
1025 -0.862561 -0.072947 0.673304 0.398148 -0.279733 1.666996
1026 -0.862561 -1.880763 3.009858 1.513364 -0.675683 -2.083786
1027 0.497802 -0.091877 0.481497 -0.063277 -0.279733 -0.179161
1028 -0.416925 2.198654 -1.070393 -1.299192 3.553066 -0.115320
1029 -0.862561 -0.404223 -1.070393 -2.015847 -0.612331 -1.751402

1030 rows × 6 columns

Using whole dataset for K-Fold Validation

In [189]:
kfold = KFold(n_splits=10)
model = GradientBoostingRegressor( criterion='mse', learning_rate = 0.05, loss= 'huber', max_depth= 5, 
                                    max_features= 2, min_samples_leaf= 7, n_estimators= 1000)
results = cross_val_score(model, X, y, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.90491341 0.94436604 0.96851674 0.94776957 0.90254233 0.95842333
 0.9517671  0.95685998 0.95863712 0.93096004]
Accuracy: 94.248% (2.158%)
In [190]:
# To find interval at 95% confidence
std95 = results.std() * 1.96
min_interval = results.mean() - std95
max_interval = results.mean() + std95
In [195]:
print('We can inform the stakeholders that the model can be said to perform between {min:.3f}% & {max:.3f}% at 95% confidence'.
      format(min = min_interval*100.0, max = max_interval*100.0))
We can inform the stakeholders that the model can be said to perform between 90.018% & 98.477% at 95% confidence





KFold CV gives better results for stakeholders